Getting started with metadata exploration
…BRIEF INTRO IN PROGRESS…
Snakemake workflow for exploring sample metadata
A tentative snakemake workflow that defines rules for downloading and expoloring sample metadata in a DAG (directed acyclic graph) format. A detailed interactive snakemake report is available here. Use a wider screen to get a better interactive snakemake report. The interactive map of project sampling points is available here.
What is metadata?
- Metadata is a set of data that describes and provides information about other data. It is commonly defined as data about data.
- Sample metadata described here refers to the description and context of the individual sample collected for a specific microbiome study.
Metadata structure
- Metadata collected at different stages are typically organized in an
Excel or Google spreadsheet where:
- The metadata table columns represent the properties of the samples.
- The table rows contain information associated with the samples.
- Typically, the first column of sample metadata is Sample ID, which designates the key associated to individual sample
- Sampl ID must be unique.
Embedded metadata
- In most cases, you will find the metadata detached from the experimental data.
- Embedded metadata integrates the experimental data especially for graphics.
- Major microbiome analysis platforms require sample metadata, commonly referred to as mapping file when performing downstream analysis.
Downloading NCBI-SRA metadata
Different methods exist for downloading sample metadata deposited in the Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA). Each process yields slightly different information, so it is an ideal habit to explore which method gives you what suits you best. For demo: We will explore more on sample metadata retrieved from four randomly selected microbiome BioProjects, including:
- PRJNA477349: 16S rRNA from bushmeat samples collected from Tanzania Metagenome (Multispecies).
- PRJNA802976: Changes to Gut Microbiota following Systemic Antibiotic Administration in Infants (Multispecies).
- PRJNA322554: The Early Infant Gut Microbiome Varies In Association with a Maternal High-fat Diet (Multispecies).
- PRJNA937707: Microbiome associated with spotting disease in the purple sea urchin (Multispecies).
- PRJNA589182: 16S rDNA gene sequencing of the phyllosphere endophytic bacterial communities colonizing wild (Multispecies).
Manually via SRA Run Selector
We can manually retrieve metadata from the SRA database via the
SRA Run Selector.
- Note that the SRA filename for metadata is automatically named SraRunTable.txt.
- Users can change the default TXT extension to like CSV if preferred.
- In our demo, we will use CSV to save the metadata file in the
data/metadata/folder.
Example screen shot of SRA Run Selector for metadata associated with the NCBI-SRA bioproject number PRJNA477349
Computationally via Entrez Direct
#!/bin/bash
esearch -db sra -query 'PRJNA477349[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA477349_metadata.csv;
esearch -db sra -query 'PRJNA802976[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA802976_metadata.csv;
esearch -db sra -query 'PRJNA322554[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA322554_metadata.csv;
esearch -db sra -query 'PRJNA937707[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA937707_metadata.csv;
esearch -db sra -query 'PRJNA589182[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA589182_metadata.csv;
Computationally using pysradb
Create pysradb environment
The pysradb tool can obtain metadata from SRA and ENA.
Here we will create an independent environment and install pysradb. We
can delete this env when no longer needed. To learn more click
here.
- First, we create a
pysradb environmentand install the pysradb tool. - Then we use
pysradbto download the SRA metadata on CLI.
conda activate base
conda create -c bioconda -n pysradb PYTHON=3 pysradb
Using a bash script
#!/bin/bash
# Shell script: workflow/scripts/pysradb_sra_metadata.sh
pysradb metadata PRJNA477349 --detailed >data/metadata/PRJNA477349_pysradb.csv
pysradb metadata PRJNA802976 --detailed >data/metadata/PRJNA802976_pysradb.csv
pysradb metadata PRJNA322554 --detailed >data/metadata/PRJNA322554_pysradb.csv
pysradb metadata PRJNA937707 --detailed >data/metadata/PRJNA937707_pysradb.csv
pysradb metadata PRJNA589182 --detailed >data/metadata/PRJNA589182_pysradb.csv
Using a python script
# Python script: workflow/scripts/pysradb_sra_metadata.py
import os
import sys
import csv
import pandas as pd
from pysradb.sraweb import SRAweb
db = SRAweb()
df = db.sra_metadata('PRJNA477349', detailed=True)
df.to_csv('data/metadata/PRJNA477349_pysradb_metadata.csv', index=False)
db = SRAweb()
df = db.sra_metadata('PRJNA802976', detailed=True)
df.to_csv('data/metadata/PRJNA802976_pysradb_metadata.csv', index=False)
db = SRAweb()
df = db.sra_metadata('PRJNA322554', detailed=True)
df.to_csv('data/metadata/PRJNA322554_pysradb_metadata.csv', index=False)
db = SRAweb()
df = db.sra_metadata('PRJNA937707', detailed=True)
df.to_csv('data/metadata/PRJNA937707_pysradb_metadata.csv', index=False)
db = SRAweb()
df = db.sra_metadata('PRJNA589182', detailed=True)
df.to_csv('data/metadata/PRJNA589182_pysradb_metadata.csv', index=False)
I sometimes experience ConnectionError when using python method. Try a different method if that happens.
Example querying SRA or ENA with a keyword
Using keywords to search any extensive database helps filter user-specified information, such as certain disease-related studies.
#!/bin/bash
mamba install -c bioconda pysradb
pysradb search --db sra -q Amplicon --max 100 >sra_amplicon_studies.csv
pysradb search --db ena -q Amplicon --max 100 >ena_amplicon_studies.csv
Exploring sample metadata
Read size
The size of the reads can tell us how big is the dataset and can help to estimate the computational needs.
Top five smallest project by read size
# A tibble: 5 × 3
run bioproject bases
<chr> <chr> <dbl>
1 SRR3725509 PRJNA322554 581
2 SRR901080 PRJNA208226 89298
3 SRR3725412 PRJNA322554 100541
4 SRR901106 PRJNA208226 114181
5 SRR3725402 PRJNA322554 117824
Top five biggest project by read size
# A tibble: 5 × 3
run bioproject bases
<chr> <chr> <dbl>
1 ERR1398210 PRJEB13870 7932024000
2 ERR1398162 PRJEB13870 7571464750
3 ERR1398085 PRJEB13870 7308113750
4 ERR1398154 PRJEB13870 7109574000
5 ERR1398136 PRJEB13870 7035274250
Compare read size by BioProject
Read size by variables within a project
Explore project sampling points
Dropping pins on the map is posible if you have coordinate data, the latitudes and longitude of collection points. By hovering on the pin you will be able to see the corresponding project, when viewed interactively.
References
Appendix
Project main tree
.
├── LICENSE
├── README.md
├── Rplots.pdf
├── config
│ ├── config.yml
│ ├── pbs
│ │ ├── cluster.yaml
│ │ └── config.yaml
│ ├── samples.tsv
│ ├── slurm
│ │ ├── cluster.yaml
│ │ └── config.yaml
│ └── units.tsv
├── dag.pdf
├── dags
│ ├── rulegraph.png
│ └── rulegraph.svg
├── data
│ └── metadata
│ ├── PRJEB13870_SraRunTable.csv
│ ├── PRJEB13870_tidy_metadata.csv
│ ├── PRJNA208226_SraRunTable.csv
│ ├── PRJNA208226_tidy_metadata.csv
│ ├── PRJNA322554_SraRunTable.csv
│ ├── PRJNA322554_tidy_metadata.csv
│ ├── PRJNA477349_SraRunTable.csv
│ ├── PRJNA477349_tidy_metadata.csv
│ ├── PRJNA589182_SraRunTable.csv
│ ├── PRJNA589182_tidy_metadata.csv
│ ├── PRJNA802976_SraRunTable.csv
│ ├── PRJNA802976_tidy_metadata.csv
│ ├── PRJNA937707_SraRunTable.csv
│ ├── PRJNA937707_tidy_metadata.csv
│ ├── bioproj_accessions.txt
│ ├── metadata.csv
│ └── mouse_gut.tsv
├── ena_amplicon_studies.csv
├── images
│ ├── PRJNA208226_read_size.png
│ ├── PRJNA208226_read_size.svg
│ ├── PRJNA477349_read_size.png
│ ├── PRJNA477349_read_size.svg
│ ├── analysis_creation_time.svg
│ ├── analysis_statistics.svg
│ ├── bkgd.png
│ ├── geeks.png
│ ├── gpsfiles
│ │ ├── sample_gps.html
│ │ └── sample_gps_files
│ ├── metadata.png
│ ├── read_size.png
│ ├── read_size.svg
│ ├── rule_runtime_date.png
│ ├── sample_gps.png
│ ├── smk_static_report.png
│ └── sra_run_selector.png
├── imap-sample-metadata.Rproj
├── index.Rmd
├── library
│ ├── apa.csl
│ ├── imap.bib
│ └── references.bib
├── report.html
├── resources
├── results
│ ├── project_tree.txt
│ ├── read_size_asc.csv
│ └── read_size_desc.csv
├── sra_amplicon_studies.csv
├── styles.css
└── workflow
├── Snakefile
├── envs
│ └── environment.yml
├── report
│ ├── PRJNA208226.rst
│ ├── PRJNA477349.rst
│ ├── gps.rst
│ ├── readsize.rst
│ └── workflow.rst
├── rules
│ ├── create_mapping_files.smk
│ ├── merge_bioproj_metadata.smk
│ ├── plot_sampling_map.smk
│ ├── plot_var_read_freq.smk
│ ├── process_tidy_metadata.smk
│ ├── read_size_table.smk
│ ├── rmd_report.smk
│ └── rules_dag.smk
├── schemas
└── scripts
├── 00_preface_n_intro.Rmd
├── 01_download_sra_metadata.Rmd
├── 02_explore_sra_metadata.Rmd
├── common.R
├── create_mapping_files.R
├── download_sra_metadata.sh
├── entrez_sra_metadata.sh
├── explore_read_size.R
├── get_run_accessions.Rmd
├── get_run_accessions.py
├── get_runinfo_columns.R
├── make_samples_units.py
├── merge_proj_metadata.R
├── plot_sampling_points.R
├── plot_var_freq.R
├── process_tidy_metadata.R
├── pysradb_sra_metadata.sh
├── read_csv.py
├── render.R
├── rules_dag.sh
├── select_runinfo.R
├── smk_html_report.sh
└── tree.sh
18 directories, 96 files

Troubleshooting of FAQs
- Question
- Question
-
Answer
-
Answer